Context summarization feature implementation #3621

filipi87 · 2026-02-02T20:59:37Z

Summary

Added automatic context summarization to compress conversation history when token limits are approached, enabling efficient long running conversations
Implemented token estimation using character-count heuristic (1 token ≈ 4 characters)
Added smart message selection that preserves system messages, recent context, and incomplete function call sequences
Configured via enable_context_summarization=True in LLMUserAggregatorParams with customizable thresholds and behavior
Runs summarization asynchronously in background tasks to avoid blocking the pipeline

Key Features

Automatic triggering: Summarization triggers when context exceeds 80% of max tokens (default 8000) or after 20 unsummarized messages
Function call awareness: Never summarizes incomplete tool call sequences, preserving request-response pairing integrity
Interruption handling: Cancels pending summarizations when user interrupts to avoid stale results
Configurable preservation: Keeps configurable number of recent messages (default: 4) uncompressed for immediate context

Configuration

from pipecat.processors.aggregators.llm_response_universal import (
    LLMUserAggregatorParams,
)
from pipecat.utils.context.llm_context_summarization import (
    LLMContextSummarizationConfig,
)

user_aggregator_params = LLMUserAggregatorParams(
    enable_context_summarization=True,
    context_summarization_config=LLMContextSummarizationConfig(
        max_context_tokens=8000,           # Maximum context size
        summarization_threshold=0.8,        # Trigger at 80% of max
        max_unsummarized_messages=20,       # Or after 20 new messages
        min_messages_after_summary=4,       # Keep last 4 messages
        summarization_prompt=None,          # Optional custom prompt
    )
)

Testing

Run the new test suite:

uv run pytest tests/test_context_summarization.py

Try the examples:

# OpenAI example with function calling
uv run examples/foundational/54-context-summarization-openai.py

# Google Gemini example
uv run examples/foundational/54a-context-summarization-google.py

Implementation Details

New Components:

src/pipecat/utils/context/llm_context_summarization.py: Core utility with token estimation, message selection, and formatting
LLMContextSummaryRequestFrame and LLMContextSummaryResultFrame: New control frames for async summarization flow
LLMContextSummarizationConfig: Configuration dataclass with validation

Modified Components:

LLMUserAggregator: Added summarization trigger logic, state tracking, and result handling
LLMService: Added async summary generation using run_inference() with max_tokens override

codecov · 2026-02-03T12:29:13Z

Codecov Report

❌ Patch coverage is 84.83965% with 52 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...pipecat/utils/context/llm_context_summarization.py	84.13%	23 Missing ⚠️
...t/processors/aggregators/llm_context_summarizer.py	90.82%	10 Missing ⚠️
src/pipecat/services/llm_service.py	81.39%	8 Missing ⚠️
...t/processors/aggregators/llm_response_universal.py	71.42%	6 Missing ⚠️
src/pipecat/services/openai/base_llm.py	40.00%	3 Missing ⚠️
src/pipecat/services/aws/llm.py	66.66%	1 Missing ⚠️
src/pipecat/services/google/llm.py	66.66%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/pipecat/frames/frames.py	`89.44% <100.00%> (+0.21%)`	⬆️
src/pipecat/services/anthropic/llm.py	`39.05% <100.00%> (ø)`
src/pipecat/services/aws/llm.py	`34.66% <66.66%> (+0.06%)`	⬆️
src/pipecat/services/google/llm.py	`41.79% <66.66%> (+0.04%)`	⬆️
src/pipecat/services/openai/base_llm.py	`55.17% <40.00%> (-0.71%)`	⬇️
...t/processors/aggregators/llm_response_universal.py	`78.82% <71.42%> (-0.37%)`	⬇️
src/pipecat/services/llm_service.py	`44.21% <81.39%> (+6.76%)`	⬆️
...t/processors/aggregators/llm_context_summarizer.py	`90.82% <90.82%> (ø)`
...pipecat/utils/context/llm_context_summarization.py	`84.13% <84.13%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/pipecat/processors/aggregators/llm_response_universal.py

src/pipecat/frames/frames.py

markbackman

This looks really great! Very clean and easy to understand. Nice work 👏

I'm going to do some testing and will let you know if I find anything.

.claude/skills/cleanup/SKILL.md

markbackman · 2026-02-06T19:12:40Z

.claude/skills/cleanup/SKILL.md

@@ -0,0 +1,307 @@
+# Code Cleanup Skill


I think this is fine to start. We will probably want to iterate until we land on something optimized.

examples/foundational/54-context-summarization-openai.py

examples/foundational/54a-context-summarization-google.py

src/pipecat/services/llm_service.py

src/pipecat/utils/context/llm_context_summarization.py

markbackman

LGTM! Thanks for the clean up.

From the first review, the only item I still see open is this one (regarding the base summary prompt):
https://github.com/pipecat-ai/pipecat/pull/3621/changes#r2775817727

Aside from that, this looks good to go. It probably makes sense to get input from someone else too since this is such a key feature and will get a ton of use.

filipi87 · 2026-02-09T13:35:33Z

I have missed that one. Fixed. Thank you for the review @markbackman. 🙌

changelog/3621.added.md

examples/foundational/54-context-summarization-openai.py

kompfner · 2026-02-09T17:59:27Z

src/pipecat/frames/frames.py

+class LLMContextSummaryRequestFrame(ControlFrame):
+    """Frame requesting context summarization from an LLM service.
+
+    Sent by aggregators to LLM services when conversation context needs to be


Might also be worth describing what the LLMs then do with that summary (i.e. push a LLMContextSummaryResultFrame, right?)

I was describing what they do inside LLMContextSummaryResultFrame.

kompfner · 2026-02-09T18:02:08Z

src/pipecat/frames/frames.py

+        context: The full LLM context containing all messages to analyze and summarize.
+        min_messages_to_keep: Number of recent messages to preserve uncompressed.
+            These messages will not be included in the summary.
+        max_context_tokens: Maximum allowed context size in tokens. The LLM should


Feedback applies throughout: should we transparent about the fact that it's approximate tokens, by maybe calling it something like max_approx_context_tokens (and updating docstring comments accordingly to let developers know what they're dealing with)

I think the name max_context_tokens is correct here, because this is what we pass to the LLM when running inference to enforce the maximum number of tokens.

The one inside LLMContextSummarizationConfig, in that case, what the user specifies is only an approximation, since the way we calculate the tokens is approximate. Even so, in my opinion we should keep the same name there, but make it clear in the docstring that the token calculation is an approximation.

What do you think ?

Ok, for now I have improved the description of max_context_tokens, inside LLMContextSummarizationConfig, explaining how the tokens are calculated.

kompfner · 2026-02-09T18:15:32Z

src/pipecat/frames/frames.py



+@dataclass
+class LLMContextSummaryRequestFrame(ControlFrame):


Q: do we want to consume these frames in the LLMs, or let them continue down the pipeline, just in case anyone wants to handle them in custom processors?

My inclination is that these seem fine to consume in the LLMs...

As we discussed in yesterday’s meeting, in this case, since we’re actually handling the frame and doing something with it, creating the summary, it feels more natural to consume the frame here rather than let it continue.

src/pipecat/frames/frames.py

src/pipecat/processors/aggregators/llm_response_universal.py

kompfner · 2026-02-09T20:01:58Z

src/pipecat/processors/aggregators/llm_response_universal.py

+            self._params.context_summarization_config or LLMContextSummarizationConfig()
+        )
+        self._summarization_in_progress = False
+        self._pending_summary_request_id: Optional[str] = None


do we need both of the above state variables? could the presence of _pending_summary_request_id indicate that a summarization is in progress?

Yeah, that could work. 👍

kompfner · 2026-02-09T21:22:08Z

src/pipecat/processors/aggregators/llm_response_universal.py

        for s in self._params.user_mute_strategies:
            await s.cleanup()

+    async def _clear_summarization(self):


nit: _clear_summarization_state might be clearer

kompfner · 2026-02-09T21:34:27Z

src/pipecat/processors/aggregators/llm_response_universal.py

+        # Apply summary
+        await self._apply_summary(frame.summary, frame.last_summarized_index)
+
+    def _validate_summary_context(self, last_summarized_index: int) -> bool:


Is this function safeguarding against the possibility of programmatic edits to the context, like from LLMMessageUpdateFrames and the like? If so, then in a PR I worked on recently (which maybe you've looked through already) I added some mechanisms for tracking with more certainty whether a context has been edited...wonder if we could join forces and use those here to determine with more certainty whether a summary still applies.

kompfner · 2026-02-09T22:07:02Z

src/pipecat/processors/aggregators/llm_response_universal.py

+        """
+        messages = self._context.messages
+
+        # Find first system message (if any)


Probably a very fringe case, but should we handle the possibility of the first system message appearing later than the last summarized index? There's technically no hard requirement that a "system"-role message has to appear at or near the beginning of the conversation, esp. with providers like OpenAI.

Or...does the summarization process already exclude the first system message? (I should probably just read on to find out, but wanted to jot this note down here).

When using OpenAI (where "system"-role messages can appear anywhere), forcing the system message to appear at the beginning on summarization might mess with conversation flow. But on the other hand, summarization does needs to work universally, and other providers don't handle "system"-role messages anywhere...

With Gemini, we "pull" the first system instruction out of the messages and use it as the overall system instruction (which it seems like the logic here is modeled after). But with AWS Bedrock, we only pull a "system"-role message our of messages and use it as system instruction if it's the first message. We're inconsistent, which isn't ideal...

As I think about it, two approaches come to mind:

What you have here

Only checking the very first message in messages for a system message

Almost always, those two are the same. So in practice I don't know if this makes much of a difference.

But it's a good reminder that we should probably do a consistency pass on how we translate "system"-role messages for different providers.

What I am doing in this PR is finding the index of the first system message and defining summary_start = first_system_index + 1.

I then summarize only the messages that come after the first system message, or everything if there is no system message.

kompfner · 2026-02-09T22:32:05Z

src/pipecat/services/llm_service.py

+        )
+
+        # Calculate max_tokens for the summary using utility method
+        max_summary_tokens = LLMContextSummarizationUtil.calculate_max_summary_tokens(


For my understanding: the max_context_tokens that the developer specifies as the point where a summary should be triggered is the same number used to compute how big the summary should be?

It seems like how big you want the summary to be should be (at least somewhat) independent—you might want to ask the summary to be relatively compact so you don't have to do it as often, rather than letting it take up all the remaining space, no?

It basically is, but when calculating the available space to define the max_tokens that I pass to the LLM, I always apply a 0.8 buffer to keep the summary at a maximum of 80% of the available space.

But I think you’re right, we should probably create a:

target_context_tokens: the target maximum context size in tokens after summarization.

What do you think?

Done. I have created target_context_tokens

kompfner · 2026-02-10T14:46:00Z

src/pipecat/processors/aggregators/llm_response_universal.py

+        token_limit_exceeded = total_tokens >= token_limit
+
+        # Check if we've exceeded max unsummarized messages
+        messages_since_summary = len(self._context.messages) - 1


nit: do you want to count the possible initial system message towards the limit? if not, you might have to subtract 2, no?

src/pipecat/processors/aggregators/llm_response_universal.py

aconchillo · 2026-02-10T18:16:26Z

src/pipecat/processors/aggregators/llm_response_universal.py

    filter_incomplete_user_turns: bool = False
    user_turn_completion_config: Optional[UserTurnCompletionConfig] = None
+    enable_context_summarization: bool = False
+    context_summarization_config: Optional[LLMContextSummarizationConfig] = None


What is the reason for adding this into the user aggregator instead of the assistant one? Just curious.

It actually, seems to be this should be done in the assistant, feels more natural, I think.

I decided to include it there because each time the user started a new turn, and we pushed a new context frame, I thought it was a nice moment to also push, as a follow up, a frame requesting summarization if needed.

This way, the LLM would have time to process it while the TTS was generating the previous answer.

So it felt like a good spot to add this logic without impacting performance.

But like we discussed on slack, we can achieve something similar using the assistant aggregator, if we use the LLMFullResponseStartFrame to trigger if we should or not request the context summarization.

aconchillo · 2026-02-10T18:25:26Z

src/pipecat/services/llm_service.py

+        logger.debug(f"{self}: Processing summarization request {frame.request_id}")
+
+        # Create a background task to generate the summary without blocking
+        self.create_task(self._generate_summary_task(frame))


we should save a reference to the task and cancel it on cleanup if necessary. we should also call

await asyncio.sleep(0)

to schedule the task in the event loop.

aconchillo · 2026-02-10T20:59:13Z

src/pipecat/services/llm_service.py

+        logger.debug(f"{self}: Processing summarization request {frame.request_id}")
+
+        # Create a background task to generate the summary without blocking
+        self.create_task(self._generate_summary_task(frame))


we still need to keep track of this task. and, since we don't await it, we need to call async asyncio.sleep(0).

Yep, I have just pushed the fix to keep track of it.

But I am not sure why we need this ?
asyncio.sleep(0)

I havent added this one yet. Is it really needed ?

If we don't cancel the task during interruption, then no, not needed. The reason is that if you cancel a task before the task is started by the event loop (different than created) you will get RuntimeWarnings saying that the task was never awaited. Since we don't cancel the task during interruptions, I think we should be ok.

aconchillo · 2026-02-10T21:31:51Z

LGTM! 👏

filipi87 marked this pull request as ready for review February 4, 2026 18:38

filipi87 requested review from aconchillo, kwindla and markbackman February 4, 2026 21:13

markbackman requested a review from kompfner February 4, 2026 22:12

filipi87 commented Feb 5, 2026

View reviewed changes

src/pipecat/processors/aggregators/llm_response_universal.py Outdated Show resolved Hide resolved

filipi87 commented Feb 5, 2026

View reviewed changes

src/pipecat/frames/frames.py Outdated Show resolved Hide resolved

markbackman reviewed Feb 6, 2026

View reviewed changes

src/pipecat/utils/context/llm_context_summarization.py Show resolved Hide resolved

filipi87 requested a review from markbackman February 6, 2026 21:47

markbackman approved these changes Feb 7, 2026

View reviewed changes

filipi87 force-pushed the filipi/context_compressure branch from 69b4a10 to 161ede2 Compare February 9, 2026 13:40

kompfner reviewed Feb 9, 2026

View reviewed changes

changelog/3621.added.md Outdated Show resolved Hide resolved

kompfner reviewed Feb 9, 2026

View reviewed changes

examples/foundational/54-context-summarization-openai.py Show resolved Hide resolved

kompfner reviewed Feb 9, 2026

View reviewed changes

src/pipecat/frames/frames.py Show resolved Hide resolved

kompfner reviewed Feb 9, 2026

View reviewed changes

src/pipecat/processors/aggregators/llm_response_universal.py Outdated Show resolved Hide resolved

kompfner reviewed Feb 9, 2026

View reviewed changes

kompfner reviewed Feb 10, 2026

View reviewed changes

aconchillo reviewed Feb 10, 2026

View reviewed changes

src/pipecat/processors/aggregators/llm_response_universal.py Outdated Show resolved Hide resolved

aconchillo reviewed Feb 10, 2026

View reviewed changes

aconchillo approved these changes Feb 10, 2026

View reviewed changes

filipi87 added 7 commits February 10, 2026 18:58

Context summarization feature implementation.

314d074

New Claude skill to help refactor and cleanup the code.

92b6ecd

Automated tests for the context summarization feature.

9d89afa

Automated tests for the context summarizer.

4a00e68

Context summarization example with OpenAI

5deb809

Context summarization example with Google

ba242d4

Changelog entries for context summarization

2475697

filipi87 force-pushed the filipi/context_compressure branch from 5a3edb6 to 2475697 Compare February 10, 2026 21:59

filipi87 merged commit a98c884 into main Feb 10, 2026
6 checks passed

filipi87 deleted the filipi/context_compressure branch February 10, 2026 22:04



		@dataclass
		class LLMContextSummaryRequestFrame(ControlFrame):

Context summarization feature implementation #3621

Context summarization feature implementation #3621

Uh oh!

Conversation

filipi87 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Configuration

Testing

Implementation Details

Uh oh!

codecov bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

markbackman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markbackman left a comment

Choose a reason for hiding this comment

Uh oh!

filipi87 commented Feb 9, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kompfner Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kompfner Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

filipi87 commented Feb 2, 2026 •

edited

Loading

codecov bot commented Feb 3, 2026 •

edited

Loading

markbackman left a comment •

edited

Loading

kompfner Feb 9, 2026 •

edited

Loading

kompfner Feb 9, 2026 •

edited

Loading